23 research outputs found

    Personalized Predictive ASR for Latency Reduction in Voice Assistants

    Full text link
    Streaming Automatic Speech Recognition (ASR) in voice assistants can utilize prefetching to partially hide the latency of response generation. Prefetching involves passing a preliminary ASR hypothesis to downstream systems in order to prefetch and cache a response. If the final ASR hypothesis after endpoint detection matches the preliminary one, the cached response can be delivered to the user, thus saving latency. In this paper, we extend this idea by introducing predictive automatic speech recognition, where we predict the full utterance from a partially observed utterance, and prefetch the response based on the predicted utterance. We introduce two personalization approaches and investigate the tradeoff between potential latency gains from successful predictions and the cost increase from failed predictions. We evaluate our methods on an internal voice assistant dataset as well as the public SLURP dataset.Comment: Accepted for Interspeech 202

    Robust Large Vocabulary Continuous Speech Recognition Using Missing Data Techniques (Robuuste spraakherkenning voor groot vocabularium gebruik makend van de techniek van de ontbrekende data)

    No full text
    De mogelijkheden om spraakherkenning in ons dagelijkse leven te integrer en nemen meer en meer toe. Met de stijgende populariteit van apparaten z oals mobiele telefoons, computers, muziekspelers en navigatiesystemen, i s de laatste jaren de vraag naar toepassingen die met de menselijke stem aangestuurd kunnen worden aanzienlijk gegroeid. Essentieel echter voor de praktische toepassing van spraakherkenning in deze systemen is de rob uustheid tegen het nadelige effect van onbekende stoorgeluiden. In tegenstelling tot menselijke luisteraars zijn systemen voor automati sche spraakherkenning buitengewoon gevoelig aan tijdsvariërende achtergr ondruis. Dit effect is te wijten aan het verschil tussen de ruisloze oms tandigheden waarin de statistische modellen van spraak worden opgesteld en de ruizige condities waaraan deze systemen in de praktijk onderhevig zijn. Zonder de aanwending van technieken die dit verschil trachten te v erkleinen, zal de nauwkeurigheid van de spraakherkenner aanzienlijk dale n. De primaire doelstelling van deze doctoraatsstudie is het spraakherkenn ingssysteem ruisrobuust maken door een techniek te hanteren die gebaseer d is op het reconstrueren van de ontbrekende data of Missing Feature The ory (MFT). In een MFT-gebaseerde herkenner zal een spectraal masker aang even welke regio's in de tijd-frequentie voorstelling van het verstoorde spraaksignaal gedomineerd worden door de achtergrondruis en ze vervolge ns classificeren als onbetrouwbaar. Deze regio's zullen in het verdere h erkenningsproces als ontbrekend worden beschouwd. In geval van een corre cte classificatie, beschikt MFT over een groot potentieel om nauwkeurige spraakherkenning te verrichten op de overblijvende informatie uit het v erstoorde spraaksignaal. In tegenstelling tot de meeste andere methodes voor ruiscompensatie, heeft MFT bovendien als belangrijk voordeel dat ha ar prestatie onafhankelijk is van het type achtergrondruis. In dit werk worden de ontbrekende componenten van de kenmerkenvectoren die uit de spraak worden berekend, geschat aan de hand van de zogenaamde data imputation techniek. Deze techniek is bruikbaar in elk ken merkendomein dat een lineaire transformatie is van het log-spectrale dom ein en is toepasbaar op zowel de statische als de dynamische kenmerkenve ctoren. Twee nieuwe maskeringsmethodes werden ontwikkeld en geëvalueerd voor maskers die ofwel een ja/nee -beslissing maken over de betrouwbaar heid van de data, ofwel hiervoor een kans schatten. Een methode die corr igeert voor de verschillen die kunnen optreden in het communicatiekanaal werd eveneens in de herkenningsprocedure geïntegreerd. Het resultaat van dit doctoraat is een MFT-gebaseerde spraakherkenner d ie robuust is tegen een brede waaier van achtergrondgeluiden en variatie s in microfoon- en kanaalkarakteristieken. Het systeem werd getest op tw ee standaarddatabanken: een klein vocabularium databank voor cijferherke nning (Aurora2) en een groot vocabularium dicteertaak (Aurora4). Met een minimum aan veronderstellingen over de achtergrondruis, behaalt het ont wikkelde systeem een herkenningsnauwkeurigheid dat behoort tot de beste gepubliceerde resultaten op beide databanken.Van Segbroeck M., ''Robust large vocabulary continuous speech recognition using missing data techniques'', Proefschrift voorgedragen tot het behalen van het doctoraat in de ingenieurswetenschappen, K.U.Leuven, January 2010, Leuven, Belgium.status: publishe

    Unsupervised learning of time-frequency patches as a noise-robust representation of speech

    No full text
    We present a self-learning algorithm using a bottom-up based approach to automatically discover, acquire and recognize the words of a language. First, an unsupervised technique using non-negative matrix factorization (NMF) discovers phone-sized time–frequency patches into which speech can be decomposed. The input matrix for the NMF is constructed for static and dynamic speech features using a spectral representation of both short and long acoustic events. By describing speech in terms of the discovered time–frequency patches, patch activations are obtained which express to what extent each patch is present across time. We then show that speaker-independent patterns appear to recur in these patch activations and how they can be discovered by applying a second NMF-based algorithm on the co-occurrence counts of activation events. By providing information about the word identity to the learning algorithm, the retrieved patterns can be associated with meaningful objects of the language. In case of a small vocabulary task, the system is able to learn patterns corresponding to words and subsequently detects the presence of these words in speech utterances. Without the prior requirement of expert knowledge about the speech as is the case in conventional automatic speech recognition, we illustrate that the learning algorithm achieves a promising accuracy and noise robustness.Van Segbroeck M., Van hamme H., ''Unsupervised learning of time-frequency patches as a noise-robust representation of speech'', Speech communication, vol. 51, no. 11, pp. 1124-1138, November 2009.status: publishe

    Advances in missing feature techniques for robust large vocabulary continuous speech recognition

    No full text
    Missing feature theory (MFT) has demonstrated great potential for improving the noise robustness in speech recognition. MFT was mostly applied in the log-spectral domain since this is also the representation in which the masks have a simple formulation. However, with diagonally structured covariance matrices in the log-spectral domain, recognition performance can only be maintained at the cost of increasing the number of Gaussians drastically. In this paper, MFT can be applied for static and dynamic features in any feature domain that is a linear transform of log-spectra. A crucial part in MFT-systems is the computation of reliability masks from noisy data. The proposed system operates on either binary masks where hard decisions are made about the reliability of the data or on fuzzy masks which use a soft decision criterion. For real-life deployments, a compensation for convolutional noise is also required. Channel compensation in speech recognition typically involves estimating an additive shift in the log-spectral or cepstral domain. To deal with the fact that some features are considered as unreliable, a maximum-likelihood estimation technique is integrated in the back-end recognition process of the MFT system to estimate the channel. Hence, the resulting MFT-based recognizer can deal with both additive and convolutional noise and shows promising results on the Aurora4 large-vocabulary database. © 2010 IEEE.Van Segbroeck M., Van hamme H., ''Advances in missing feature techniques for robust large vocabulary continuous speech recognition'', IEEE transactions on audio, speech, and language processing, vol. 19, no. 1, pp. 123-137, January 2011.status: publishe

    Handling convolutional noise in missing data automatic speech recognition

    No full text
    Missing Data Techniques have already shown their effectiveness in dealing with additive noise in automatic speech recognition systems. For real-life deployments, a compensation for linear filtering distortions is also required. Channel compensation in speech recognition typically involves estimating an additive shift in the log-spectral or cepstral domain. This paper explores a maximum likelihood technique to estimate this model offset while some data are missing. Recognition experiments on the Aurora2 recognition task demonstrate the effectiveness of this technique. In particular, we show that our method is more accurate than previously published methods and can handle narrow-band data.status: publishe

    Vector-Quantization based MaskEstimation for Missing Data Automatic Speech Recognition

    No full text
    The application of Missing Data Theory (MDT) has shown to improve the robustness of automatic speech recognition (ASR) systems. A crucial part in a MDT-based recognizer is the computation of the reliability masks from noisy data. To estimate accurate masks in environments with unknown, non-stationary noise statistics only weak assumptions can be made about the noise and we need to rely on a strong model for the speech. In this paper, we present a missing data detector that uses harmonicity in the noisy input signal and a vector quantizer (VQ) to confine speech models to a subspace. The resulting system can deal with additive and convolutional noise and shows promising results on the Aurora4 large vocabulary database. Index Terms: speech recognition, noise robustness, missing data mask estimation, speech separatio

    Robust speech recognition using missing data techniques in the prospect domain and fuzzy masks

    No full text
    Missing data theory (MDT) has been applied to handle the problem of noise-robust speech recognition. Conventional MDT-systems require acoustic models that are expressed in the log-spectral rather than in the cepstral domain, which leads to a loss in accuracy. Therefore, we have already introduced a MDT-technique that can be applied in any feature domain that is a linear transform of log-spectra. This MDT-system requires hard decisions about the reliability of each spectral component. When computed from noisy data, misclassification errors in the mask are hardly unavoidable and the recognition rate will significantly degrade. The risk of misclassifications can be reduced by estimating a probability that the component is reliable, e.g. a fuzzy mask. In this paper, we extend our MDT-system to be applied in the probabilistic decision framework. Experiments on the Aurora2 database demonstrate a further increase in recognition accuracy, especially at low SNRs.Van Segbroeck M., Van hamme H., ''Robust speech recognition using missing data techniques in the PROSPECT domain and fuzzy masks'', Proceedings IEEE international conference on acoustics, speech, and signal processing - ICASSP’2008, pp. 4393-4396, March 30 - April 4, 2008, Las Vegas, Nevada, USA.status: publishe
    corecore